Snorkel: A System for Lightweight Extraction
نویسندگان
چکیده
We describe a vision and an initial prototype system for extracting structured data from unstructured or dark input sources–such as text, embedded tables, images, and diagrams–called Snorkel, in which users write traditional extraction scripts which are automatically enhanced by machine learning techniques. The key technical idea is to view the user’s actions with standard tools as implicitly defining a statistical model. For example, to extract mentions of supplier-purchaser relations in SEC filings, a user of Snorkel might write several scripts that cross-reference against lists of company names, known supplier-purchaser relations, or specific textual patterns. Snorkel is able to automatically assess each script’s reliability for the task using only unlabeled data, integrate their outputs together in a statistically sound way, and use the combined signals to train a machine learning model with automatically generated features to perform the task more accurately and broadly. Compared to current machine-learning approaches to this task, Snorkel is our attempt to make an end-run around two major pain points: hand-labeling training data and feature engineering. More broadly, Snorkel is a first step toward our vision of a new generation of data systems that are observational: these systems will observe users using standard tools, and use machine learning techniques “behind the scenes”to improve their performance. In several preliminary hacakathons, non-expert users from the biomedical domain have quickly neared or exceeded competition benchmarks, and Snorkel is now in use by a handful of technology companies, government organizations, and scientists. Light-weight Macroscopic Analysis Snorkel is intended for tasks in which users’ time and technical skills are limited, and the output schema is unknown or rapidly changing. Typically, dark data methods are deployed only in large corporations and government agencies due to their expense and high technical barrier to entry. Moreover, they are only deployed in situations in which a fixed, high-value schema is known in advance. Examples that do not fit the old paradigm include researchers looking for previously unnoticed drug interactions in electronic health records, government agencies and NGOs responding to disaster, and financial analysts poring over a trove of newly released earning reports. In these scenarios, users have on the order of a week to write high-quality dark data extraction programs. Snorkel empowers them to write programs that are radically more robust and produce radically higher quality data than even finely tuned regular expressions or Python scripts. Snorkel’s Model The Snorkel user interface is centered around writing labeling functions, pieces of code that heuristically label data according to the users’ desired output.
منابع مشابه
A Lightweight Intrusion Detection System Based on Specifications to Improve Security in Wireless Sensor Networks
Due to the prevalence of Wireless Sensor Networks (WSNs) in the many mission-critical applications such as military areas, security has been considered as one of the essential parameters in Quality of Service (QoS), and Intrusion Detection System (IDS) is considered as a fundamental requirement for security in these networks. This paper presents a lightweight Intrusion Detection System to prote...
متن کاملSnorkel: Beyond Hand-labeled Data
This talk describes Snorkel, a software system whose goal is to make routine machine learning tasks dramatically easier. Snorkel focuses on a key bottleneck in the development of machine learning systems: the lack of large training datasets for a user’s task. In Snorkel, a user implicitly defines large training sets by writing simple programs that create labeled data, instead of tediously hand-...
متن کاملThe “Oil-Spill Snorkel”: an innovative bioelectrochemical approach to accelerate hydrocarbons biodegradation in marine sediments
This study presents the proof-of-concept of the "Oil-Spill Snorkel": a novel bioelectrochemical approach to stimulate the oxidative biodegradation of petroleum hydrocarbons in sediments. The "Oil-Spill Snorkel" consists of a single conductive material (the snorkel) positioned suitably to create an electrochemical connection between the anoxic zone (the contaminated sediment) and the oxic zone (...
متن کاملWave Characteristics in Breaststroke Technique with and Without Snorkel Use
The purpose of this paper was to examine the characteristics of waves generated when swimming with and without the use of Aquatrainer® snorkels. Eight male swimmers performed two maximal bouts of 25 m breaststroke, first without the use of a snorkel (normal condition) and then using a snorkel (snorkel condition). The body landmarks, centre of the mass velocity, stroke rate, stroke length, strok...
متن کاملEffect of nasopharyngeal snorkel on respiratory function in patients with stroke
Stroke causes significant mortality and morbidity. The clinical value of the nasopharyngeal snorkel was investigated in stroke patients with disorders of consciousness. A total of 155 stroke patients were randomly divided into two groups: a nasopharyngeal snorkel was used in the treatment group (n=78) and an oropharyngeal snorkel was used in the control group (n=77). The PaO2 and PCO2 of both g...
متن کامل